Last time we translated a story about the data generating process into a statistical model. In general, a model looks like this:
\[y = \beta_0 + \beta_1 x + \epsilon\]
Because \(y = \beta_0 + \beta_1 x\) is a linear equation, adding \(\epsilon\) makes this a linear model.
But what kind of error?
If the error includes every other explanatory variable beyond the x that we’ve included in the model, what does the sum of all of their effects look like?
We have to assume \(\epsilon\) is distributed a certain way, typically:
\[\epsilon \sim Normal(0, 1)\]
The normal distribution, and every other probability distribution, has a few essential components:
The normal distribution is one of many probability distributions. Here’s the uniform distribution.
Okay, but why should we use a linear model and assume \(\epsilon \sim Normal(0, 1)\)?
A linear model with normal error and a continuous outcome \(y\) is known as a regression.
We now have a complete statistical model of the data generating process:
\[y = \beta_0 + \beta_1 x + \epsilon, \text{ where } \epsilon \sim Normal(0, 1)\]
Remember that the data generating process is the unobserved process that generates the data, so before we work with real data where we never know the data generating process, we can assume that our model is the data generating process, choose values for the parameters, and generate or simulate data using the model.
Why? To start with, we can prepare our data analysis before getting real data.
How do we simulate data from this?
\[y = \beta_0 + \beta_1 x + \epsilon, \text{ where } \epsilon \sim Normal(0, 1)\]
# Load packages. library(tidyverse) # Set the randomization seed to simulate the same data each time. set.seed(42) # Set the parameter values. beta0 <- 3 beta1 <- 7 # Simulate data. sim_data <- tibble( x = runif(100, min = 0, max = 7), y = beta0 + beta1 * x + rnorm(100, mean = 0, sd = 3) )
Whenever we are using randomization, for example, when simulating data or running a model that relies on randomization, we want to use set.seed() so we can get the same results every time we render.
# Random number are, well, random. rnorm(1)
## [1] -0.04069848
rnorm(1)
## [1] -1.551545
rnorm(1)
## [1] 1.16717
# But using set.seed() we at least start at the *same* random number each time. set.seed(42) rnorm(1)
## [1] 1.370958
set.seed(42) rnorm(1)
## [1] 1.370958
set.seed(42) rnorm(1)
## [1] 1.370958
sim_data |> ggplot(aes(x = x, y = y)) + geom_point()
Okay, what have we talked about so far?
Our goal is to use the model to estimate the unobserved parameters from the data (i.e., make our best guess).
In other words, an inferential model extracts parameter estimates from the data to inform our managerial decision.
The best line should be the one that makes the sum of the vertical bars as small as possible.
The vertical bars are called residuals, and represent the distance between the data \(y\) and a particular line.
Residuals can be positive and negative, so we make the sum of the squared residuals as small as possible.
The tidymodels framework is a collection of packages for modeling using tidyverse principles. The goal is to provide the same consistency and ease-of-use for modeling that the tidyverse provides for importing, wrangling, visualizing, and reporting data.
# Load packages. library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1 ## ✔ dials 1.3.0 ✔ tune 1.2.1 ## ✔ infer 1.0.7 ✔ workflows 1.1.4 ## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0 ## ✔ parsnip 1.2.1 ✔ yardstick 1.3.2 ## ✔ recipes 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── ## ✖ scales::discard() masks purrr::discard() ## ✖ dplyr::filter() masks stats::filter() ## ✖ recipes::fixed() masks stringr::fixed() ## ✖ dplyr::lag() masks stats::lag() ## ✖ yardstick::spec() masks readr::spec() ## ✖ recipes::step() masks stats::step() ## • Use suppressPackageStartupMessages() to eliminate package startup messages
In the tidymodels framework, we specify the model type and then set the engine we’d like to estimate the model with. The engine is either another R package that likely requires its own unique syntax or an entirely different modeling language.
This allows us to use a variety of modeling packages and languages while keeping the same syntax.
# Specify the model type and engine.
linear_reg() |>
set_engine("lm")
## Linear Regression Model Specification (regression) ## ## Computational engine: lm